Search CORE

47 research outputs found

Learning to embed semantic similarity for joint image-text retrieval

Author: Keller Yosi
Malali Noam
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 07/10/2022
Field of study

We present a deep learning approach for learning the joint semantic embeddings of images and captions in a Euclidean space, such that the semantic similarity is approximated by the L2 distances in the embedding space. For that, we introduce a metric learning scheme that utilizes multitask learning to learn the embedding of identical semantic concepts using a center loss. By introducing a differentiable quantization scheme into the end-to-end trainable network, we derive a semantic embedding of semantically similar concepts in Euclidean space. We also propose a novel metric learning formulation using an adaptive margin hinge loss, that is refined during the training phase. The proposed scheme was applied to the MS-COCO, Flicke30K and Flickr8K datasets, and was shown to compare favorably with contemporary state-of-the-art approaches.Comment: in IEEE Transactions on Pattern Analysis and Machine Intelligence, 202

arXiv.org e-Print Archive

Camera Pose Auto-Encoders for Improving Pose Regression

Author: Keller Yosi
Shavit Yoli
Publication venue
Publication date: 12/07/2022
Field of study

Absolute pose regressor (APR) networks are trained to estimate the pose of the camera given a captured image. They compute latent image representations from which the camera position and orientation are regressed. APRs provide a different tradeoff between localization accuracy, runtime, and memory, compared to structure-based localization schemes that provide state-of-the-art accuracy. In this work, we introduce Camera Pose Auto-Encoders (PAEs), multilayer perceptrons that are trained via a Teacher-Student approach to encode camera poses using APRs as their teachers. We show that the resulting latent pose representations can closely reproduce APR performance and demonstrate their effectiveness for related tasks. Specifically, we propose a light-weight test-time optimization in which the closest train poses are encoded and used to refine camera position estimation. This procedure achieves a new state-of-the-art position accuracy for APRs, on both the CambridgeLandmarks and 7Scenes benchmarks. We also show that train images can be reconstructed from the learned pose encoding, paving the way for integrating visual information from the train set at a low memory cost. Our code and pre-trained models are available at https://github.com/yolish/camera-pose-auto-encoders.Comment: Accepted to ECCV2

arXiv.org e-Print Archive

Paying Attention to Multiscale Feature Maps in Multimodal Image Matching

Author: Keller Yosi
Moreshet Aviad
Publication venue
Publication date: 20/03/2021
Field of study

We propose an attention-based approach for multimodal image patch matching using a Transformer encoder attending to the feature maps of a multiscale Siamese CNN. Our encoder is shown to efficiently aggregate multiscale image embeddings while emphasizing task-specific appearance-invariant image cues. We also introduce an attention-residual architecture, using a residual connection bypassing the encoder. This additional learning signal facilitates end-to-end training from scratch. Our approach is experimentally shown to achieve new state-of-the-art accuracy on both multimodal and single modality benchmarks, illustrating its general applicability. To the best of our knowledge, this is the first successful implementation of the Transformer encoder architecture to the multimodal image patch matching task

arXiv.org e-Print Archive

Hierarchical Attention-based Age Estimation and Bias Estimation

Author: Hiba Shakediel
Keller Yosi
Publication venue
Publication date: 17/03/2021
Field of study

In this work we propose a novel deep-learning approach for age estimation based on face images. We first introduce a dual image augmentation-aggregation approach based on attention. This allows the network to jointly utilize multiple face image augmentations whose embeddings are aggregated by a Transformer-Encoder. The resulting aggregated embedding is shown to better encode the face image attributes. We then propose a probabilistic hierarchical regression framework that combines a discrete probabilistic estimate of age labels, with a corresponding ensemble of regressors. Each regressor is particularly adapted and trained to refine the probabilistic estimate over a range of ages. Our scheme is shown to outperform contemporary schemes and provide a new state-of-the-art age estimation accuracy, when applied to the MORPH II dataset for age estimation. Last, we introduce a bias analysis of state-of-the-art age estimation results.Comment: 11 pages, 7 figure

arXiv.org e-Print Archive